1. Exploring and cleaning the data

uni <- read.table(file="/Volumes/SD/GoogleDrive/0.DA/Assignments/2/uni.csv",
                        header=TRUE, sep=";", fill = TRUE, quote="\"")
head(uni)
##                           Name Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University     Yes 1660   1232    721        23
## 2           Adelphi University     Yes 2186   1924    512        16
## 3               Adrian College     Yes 1428   1097    336        22
## 4          Agnes Scott College     Yes  417    349    137        60
## 5    Alaska Pacific University     Yes  193    146     55        16
## 6            Albertson College     Yes  587    479    158        38
##   Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1        52        2885         537     7440       3300   450     2200  70
## 2        29        2683        1227    12280       6450   750     1500  29
## 3        50        1036          99    11250       3750   400     1165  53
## 4        89         510          63    12960       5450   450      875  92
## 5        44         249         869     7560       4120   800     1500  76
## 6        62         678          41    13500       3335   500      675  67
##   Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1       78      18.1          12   7041        60
## 2       30      12.2          16  10527        56
## 3       66      12.9          30   8735        54
## 4       97       7.7          37  19016        59
## 5       72      11.9           2  10922        15
## 6       73       9.4          11   9727        55

We will first check the distribution of each variable and correct possible outliers.

summary(uni)
##                            Name     Private        Apps      
##  Abilene Christian University:  1   NB :  4   Min.   :   81  
##  Adelphi University          :  1   No :212   1st Qu.:  776  
##  Adrian College              :  1   Yes:561   Median : 1558  
##  Agnes Scott College         :  1             Mean   : 3002  
##  Alaska Pacific University   :  1             3rd Qu.: 3624  
##  Albertson College           :  1             Max.   :48094  
##  (Other)                     :771                            
##      Accept          Enroll         Top10perc       Top25perc    
##  Min.   :   72   Min.   :  35.0   Min.   : 1.00   Min.   :  9.0  
##  1st Qu.:  604   1st Qu.: 243.0   1st Qu.:15.00   1st Qu.: 41.0  
##  Median : 1110   Median : 437.0   Median :23.00   Median : 54.0  
##  Mean   : 2019   Mean   : 794.1   Mean   :27.56   Mean   : 55.8  
##  3rd Qu.: 2424   3rd Qu.: 904.0   3rd Qu.:35.00   3rd Qu.: 69.0  
##  Max.   :26330   Max.   :7157.0   Max.   :96.00   Max.   :100.0  
##                                                                  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board  
##  Min.   :  139   Min.   :    1.0   Min.   : 2340   Min.   :1780  
##  1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320   1st Qu.:3597  
##  Median : 1707   Median :  353.0   Median : 9990   Median :4200  
##  Mean   : 3700   Mean   :  855.3   Mean   :10441   Mean   :4358  
##  3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925   3rd Qu.:5050  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :8124  
##                                                                  
##      Books           Personal         PhD            Terminal    
##  Min.   :  96.0   Min.   : 250   Min.   :  8.00   Min.   : 24.0  
##  1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00   1st Qu.: 71.0  
##  Median : 500.0   Median :1200   Median : 75.00   Median : 82.0  
##  Mean   : 549.4   Mean   :1341   Mean   : 72.66   Mean   : 79.7  
##  3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.0  
##  Max.   :2340.0   Max.   :6800   Max.   :103.00   Max.   :100.0  
##                                                                  
##    S.F.Ratio       perc.alumni        Expend        Grad.Rate     
##  Min.   :  2.50   Min.   : 0.00   Min.   : 3186   Min.   : 10.00  
##  1st Qu.: 11.50   1st Qu.:13.00   1st Qu.: 6751   1st Qu.: 53.00  
##  Median : 13.60   Median :21.00   Median : 8377   Median : 65.00  
##  Mean   : 14.22   Mean   :22.74   Mean   : 9660   Mean   : 65.47  
##  3rd Qu.: 16.50   3rd Qu.:31.00   3rd Qu.:10830   3rd Qu.: 78.00  
##  Max.   :122.10   Max.   :64.00   Max.   :56233   Max.   :118.00  
##                                                   NA's   :3

We first noticed that the Private column is encoded as Yes, No or NB. We will turn the Yes value to 1, all others will be 0.

uni$Private <- as.integer(uni$Private == "Yes")
head(uni)
##                           Name Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University       1 1660   1232    721        23
## 2           Adelphi University       1 2186   1924    512        16
## 3               Adrian College       1 1428   1097    336        22
## 4          Agnes Scott College       1  417    349    137        60
## 5    Alaska Pacific University       1  193    146     55        16
## 6            Albertson College       1  587    479    158        38
##   Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1        52        2885         537     7440       3300   450     2200  70
## 2        29        2683        1227    12280       6450   750     1500  29
## 3        50        1036          99    11250       3750   400     1165  53
## 4        89         510          63    12960       5450   450      875  92
## 5        44         249         869     7560       4120   800     1500  76
## 6        62         678          41    13500       3335   500      675  67
##   Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1       78      18.1          12   7041        60
## 2       30      12.2          16  10527        56
## 3       66      12.9          30   8735        54
## 4       97       7.7          37  19016        59
## 5       72      11.9           2  10922        15
## 6       73       9.4          11   9727        55
require(ggplot2)
require(data.table)

ggplot(data = melt(uni), mapping = aes(x = value)) + 
  geom_histogram(bins = 20) + facet_wrap(~variable, scales = 'free_x')

There are unusual values for several columns (P.Undergrad, S.F.Ratio and Books) which we will filter out by replacing them with median values. The values might actually be plausible but they are such outliers that it’s a good idea to replace them.

uni$Books[(uni$Books > 1500)] <- median(uni$Books, na.rm=TRUE)
uni$P.Undergrad[(uni$P.Undergrad > 20000)] <- median(uni$P.Undergrad, na.rm=TRUE)
uni$S.F.Ratio[(uni$S.F.Ratio > 100)] <- median(uni$S.F.Ratio, na.rm=TRUE)

We also have to fix NA values:

uni$Grad.Rate[is.na(uni$Grad.Rate)] <- median(uni$Grad.Rate, na.rm=TRUE)

2. Linear regression for single predictor

First we will remove non-numeric columns like the University name from the dataframe:

uni = uni[,-1]

Analyzing correlations

Now we can start the analysis by printing and plotting a graph of correlations of all variables with the Acceptance number:

print(cor(uni)[,2])
##     Private        Apps      Accept      Enroll   Top10perc   Top25perc 
## -0.42196498  1.00000000  0.94345057  0.80722854  0.33883368  0.35163990 
## F.Undergrad P.Undergrad    Outstate  Room.Board       Books    Personal 
##  0.81449058  0.41478874  0.05015903  0.16493896  0.16083904  0.17873085 
##         PhD    Terminal   S.F.Ratio perc.alumni      Expend   Grad.Rate 
##  0.39069733  0.36949147  0.09529911 -0.09022589  0.25959198  0.14682609

We can also create a barplot of the correlations:

barplot(cor(uni)[,2], width = 30)

An important trend to notice is that most of the variables are positively correlated with the number of accepted student, the one notable exception is whether the university is private. Another smaller expection is Percent Alumni column meaning that schools with more alumni who donate accept fewer students.

Fitting linear models

Now we are going to fit a linear model for every every single predictor in X (ie. each variable/column) to predict y value.
We will also create a plot for each of those showing it’s predictive capabilites.
It’s important to understand that we don’t expect very good results from this experiment because we using a limited set of information.
We do expect that predictors which are positively correlated will have upward slopes while the negatively correlated predictors will have downward slopes (Private).

for(i in names(uni[,-2])){
  mod1 <- lm(Apps ~ get(i), data=uni)
  x = uni[,c(i)]
  y = uni$Apps
  plot(x, y, xlim=c(min(x)-1, max(x)+1), ylim=c(min(y)-10, max(y)+10), xlab = i, ylab = 'applications')
  abline(mod1, lwd=2)
}

Conclusion

We can notice that variables with negative correlation to the number of accepted students also had downward (negatively) sloped linear regression charts (ie. negative coefficient).
On the other all positively correlated variables had upward sloping models (ie. positive coefficients).

3. Fitting multiple regression model

General approach

Now we will try to predict the number of accepted students using all of the variables (predictors).
First we will fit a linear model.
Then we are going to analyze it’s coefficients to determine which predictors are useful and then the correlations between predictors.

lm.fit = lm(Apps~., data=uni)

After we fit the model we will explore it’s internals and metrics:

summary(lm.fit, correlation=TRUE)
## 
## Call:
## lm(formula = Apps ~ ., data = uni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5005.3  -433.0   -34.0   324.2  8702.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -570.62890  410.82396  -1.389 0.165244    
## Private     -495.26757  136.23366  -3.635 0.000296 ***
## Accept         1.51537    0.03738  40.535  < 2e-16 ***
## Enroll        -0.28673    0.10907  -2.629 0.008742 ** 
## Top10perc     47.27397    5.61554   8.418  < 2e-16 ***
## Top25perc    -13.05656    4.52120  -2.888 0.003989 ** 
## F.Undergrad   -0.01738    0.02575  -0.675 0.500052    
## P.Undergrad    0.02050    0.03925   0.522 0.601573    
## Outstate      -0.08646    0.01921  -4.501 7.82e-06 ***
## Room.Board     0.17176    0.04871   3.526 0.000447 ***
## Books          0.20364    0.28106   0.725 0.468963    
## Personal       0.02556    0.06419   0.398 0.690639    
## PhD           -7.89128    4.65808  -1.694 0.090655 .  
## Terminal      -3.80930    5.14388  -0.741 0.459196    
## S.F.Ratio     14.07978   13.14300   1.071 0.284386    
## perc.alumni   -0.38949    4.17088  -0.093 0.925623    
## Expend         0.07741    0.01249   6.196 9.48e-10 ***
## Grad.Rate      7.94597    2.98554   2.661 0.007944 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1053 on 759 degrees of freedom
## Multiple R-squared:  0.9276, Adjusted R-squared:  0.9259 
## F-statistic: 571.6 on 17 and 759 DF,  p-value: < 2.2e-16
## 
## Correlation of Coefficients:
##             (Intercept) Private Accept Enroll Top10perc Top25perc
## Private     -0.28                                                
## Accept       0.14        0.04                                    
## Enroll      -0.09       -0.03   -0.34                            
## Top10perc    0.27       -0.01    0.12  -0.06                     
## Top25perc   -0.19        0.00   -0.06   0.05  -0.80              
## F.Undergrad  0.02        0.18   -0.41  -0.60  -0.05     -0.04    
## P.Undergrad -0.01        0.04    0.07   0.09   0.08     -0.01    
## Outstate     0.06       -0.33   -0.17   0.05  -0.05      0.00    
## Room.Board  -0.17       -0.09   -0.10   0.03   0.04      0.01    
## Books       -0.22       -0.05   -0.01  -0.01  -0.04     -0.04    
## Personal    -0.28        0.00    0.06   0.01  -0.01      0.01    
## PhD         -0.01        0.12   -0.02  -0.04  -0.12     -0.01    
## Terminal    -0.25        0.10    0.01   0.06   0.11     -0.12    
## S.F.Ratio   -0.60        0.15   -0.02   0.01   0.00      0.03    
## perc.alumni -0.04       -0.10    0.16  -0.09  -0.02     -0.06    
## Expend      -0.22        0.08   -0.04  -0.01  -0.38      0.21    
## Grad.Rate   -0.25       -0.07   -0.13   0.04  -0.07     -0.08    
##             F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## Private                                                               
## Accept                                                                
## Enroll                                                                
## Top10perc                                                             
## Top25perc                                                             
## F.Undergrad                                                           
## P.Undergrad -0.33                                                     
## Outstate     0.09        0.01                                         
## Room.Board   0.05       -0.14       -0.34                             
## Books        0.00        0.00        0.04    -0.10                    
## Personal    -0.11       -0.09        0.07     0.07      -0.21         
## PhD          0.04       -0.05       -0.06     0.02       0.09 -0.04   
## Terminal    -0.08       -0.02       -0.09    -0.11      -0.09  0.04   
## S.F.Ratio   -0.06        0.00        0.08     0.01      -0.05  0.07   
## perc.alumni -0.01        0.10       -0.20     0.15       0.02  0.09   
## Expend      -0.01        0.01       -0.28    -0.11      -0.04 -0.05   
## Grad.Rate    0.01        0.13       -0.14    -0.13       0.04  0.07   
##             PhD   Terminal S.F.Ratio perc.alumni Expend
## Private                                                
## Accept                                                 
## Enroll                                                 
## Top10perc                                              
## Top25perc                                              
## F.Undergrad                                            
## P.Undergrad                                            
## Outstate                                               
## Room.Board                                             
## Books                                                  
## Personal                                               
## PhD                                                    
## Terminal    -0.73                                      
## S.F.Ratio   -0.06  0.02                                
## perc.alumni  0.01 -0.08     0.06                       
## Expend      -0.02 -0.05     0.37     -0.02             
## Grad.Rate   -0.05  0.05    -0.03     -0.20        0.09

We can observe several things:

  1. The Coefficients statistics tell us how significant each predictor is. A small p-value indicates that it is unlikely we will observe a relationship between the predictor and response variables due to chance. For the predictors whose p-value is below 0.01 we can say that there is a relationship between the predictor (x) and the response. (y). This means that we can reject the null hypothesis that there is no relationship between the predictor and the response.
    Using the p-value metric the most influential predictors are: Private, Accept, Enroll, Top10perc, Top25perc, Outstate, Room.Board, Expend and Grad.Rate. For all other predictors we can accept the null hypothesis that there is no relationship between the predictor and the response.

  2. By analyzing the Correlation of Coefficients we can observe that as expected the predictors (and their coefficients) are not uncorrelated but rather mostly slighly correlated.

  3. The R-squared value of 0.9276 is a solid result as it is very close to (ideal score) of 1. This value is somewhat higher than in actual use because it’s based on the same data the model was trained with.

  4. The F-Statistic has a very low p-value meaning that the null hypothesis that all predictors are not useful is very unlikely. The F-Statistic rather concludes that atleast one of the predictors is useful in predicting the response.

Variable subset approach

We can use the package leaps to to find the best subset of 5 best variables using forward feature selection.

library(leaps)
regfit.fwd=regsubsets(Apps~.,data=uni,nvmax=5,method="forward")

summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., data = uni, nvmax = 5, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## Private         FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
##          Private Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1  ( 1 ) " "     "*"    " "    " "       " "       " "         " "        
## 2  ( 1 ) " "     "*"    " "    "*"       " "       " "         " "        
## 3  ( 1 ) " "     "*"    " "    "*"       " "       " "         " "        
## 4  ( 1 ) " "     "*"    " "    "*"       " "       " "         " "        
## 5  ( 1 ) " "     "*"    "*"    "*"       " "       " "         " "        
##          Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "      " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "      " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "      " "        " "   " "      " " " "      " "      
## 4  ( 1 ) "*"      " "        " "   " "      " " " "      " "      
## 5  ( 1 ) "*"      " "        " "   " "      " " " "      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) " "         " "    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) " "         "*"    " "      
## 5  ( 1 ) " "         "*"    " "

We can see that the top 5 features are chosen in the following order of importance:
1. Accept
2. Top10Perc
3. Expend
4. Outstate
5. Enroll

Now we can also check the coefficients of the model:

coef(regfit.fwd, 5)
##   (Intercept)        Accept        Enroll     Top10perc      Outstate 
## -554.38069351    1.53453168   -0.33517131   32.13461106   -0.09057353 
##        Expend 
##    0.08119256

To conclude, the forward feature selection procedure found a subset of variables which are most influential: Accept, Top10Perc, Expend, Outstate, Enroll.

4. Comparing the results of simple and multiple regression

When analyzing the performance of linear models built on a single predictor we concluded that the results were highly affected by correlations between individual variables.
We will again compute the correlations between individual variable and filter out the correlations of absolute value less than 0.35 as they are weaker correlations compared to other.

d = cor(uni)[,2]
d = d[-2]
d[abs(d) > 0.35 ]
##     Private      Accept      Enroll   Top25perc F.Undergrad P.Undergrad 
##  -0.4219650   0.9434506   0.8072285   0.3516399   0.8144906   0.4147887 
##         PhD    Terminal 
##   0.3906973   0.3694915

From this list we find that the most correlated single variables are: Private, Accept, Enroll, Top25perc, F.Undergrad, P.Undergrad, PhD, Terminal.

Now we will first compare this with the most influential variables based on their Standard Error and p-value that we calculated using multiple regression model. We had already determined those variables and they are: Private, Accept, Enroll, Top10perc, Top25perc, Outstate, Room.Board, Expend and Grad.Rate.
Furthermore, using feature selection on a multiple regression model we determined that 5 most informative variables were Accept, Top10Perc, Expend, Outstate, Enroll.

We can notice that variables Private, Accept, Enroll, Top25perc show up in both the simple single variable correlations and in the more advanced models suggesting that they are predictors with the strongest relationship to the Number of applications of students.

This makes a lot of sense as the number of applicants accepted, number of new students enrolled and number of new students from top 25 % of high school class are logically very extremely correlated to the number of applications because the students that applied to the university are also included in those numbers.

5. Evaluating multiple regression models

To better evaluate our model we have to test it on some data that it wasn’t trained with before because if we test the model with data points it was trained with it will likely produce unrealistically good results.

We will first split our data into a training dataset (75%) and a testing dataset (25%).
Using the training and testing split we will train our model on the 75% of the data and test it on the other 25%.
To analyze it’s performace we will compute the R-squared and RMSE metrics.

We will also evaluate the model based on all of the training data using 10-fold cross validation. This process is very useful for example to find the best hyperparameters of a model.

I will be using the caret and leaps packages for these calculations.

First we will split our dataset into training and testing data. Training will be done on 75% of the data and testing on the remaining 25%.

library(caret)
library(leaps)
set.seed(41)
train <- sample(seq_len(nrow(uni)), size = floor(0.75 * nrow(uni)))
test <- (-train)

Now let’s train a multiple linear regression model on our training data.
We will also summarize the model to quickly go over it’s metrics.

lm.fit = lm(Apps~.,data=uni[train,])
print(summary(lm.fit))
## 
## Call:
## lm(formula = Apps ~ ., data = uni[train, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5249.4  -414.5   -43.1   293.8  7510.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.258e+02  4.845e+02  -0.260   0.7953    
## Private     -6.143e+02  1.565e+02  -3.924 9.78e-05 ***
## Accept       1.639e+00  4.363e-02  37.565  < 2e-16 ***
## Enroll      -9.879e-01  2.210e-01  -4.470 9.45e-06 ***
## Top10perc    3.944e+01  6.322e+00   6.239 8.66e-10 ***
## Top25perc   -8.948e+00  5.096e+00  -1.756   0.0797 .  
## F.Undergrad  6.114e-02  3.894e-02   1.570   0.1169    
## P.Undergrad  6.027e-03  4.746e-02   0.127   0.8990    
## Outstate    -9.830e-02  2.235e-02  -4.398 1.31e-05 ***
## Room.Board   1.208e-01  5.658e-02   2.136   0.0331 *  
## Books        1.224e-01  3.240e-01   0.378   0.7057    
## Personal    -3.030e-03  7.219e-02  -0.042   0.9665    
## PhD         -1.200e+01  5.368e+00  -2.236   0.0257 *  
## Terminal    -1.091e-01  5.895e+00  -0.019   0.9852    
## S.F.Ratio    3.775e+00  1.627e+01   0.232   0.8166    
## perc.alumni  1.730e-01  4.866e+00   0.036   0.9717    
## Expend       9.355e-02  1.868e-02   5.008 7.37e-07 ***
## Grad.Rate    8.744e+00  3.489e+00   2.506   0.0125 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1033 on 564 degrees of freedom
## Multiple R-squared:  0.9368, Adjusted R-squared:  0.9349 
## F-statistic: 491.8 on 17 and 564 DF,  p-value: < 2.2e-16

Now we will evaluate the model on our testing data according to R-squared and RMSE metrics:

lm_pred <- predict(lm.fit, uni[test,])
lm_rmse <- RMSE(uni$Apps[test],lm_pred)
lm_r2 <- R2(uni$Apps[test],lm_pred)

sprintf('RMSE = %f',lm_rmse)
## [1] "RMSE = 1239.317749"
sprintf('r-squared = %f',lm_r2)
## [1] "r-squared = 0.864277"

Using 10-fold cross validation we can also estimate how good our model is by training it 10 times on 90% of the training data and testing it each of the 10 times on the remaining 10%:

train_control <- trainControl(method="cv", number=10, verboseIter = TRUE)
# train the model
model_lm_cv <- train(Apps~., data=uni[train,], trControl=train_control, method="lm")
## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

We can now analyze the results of Cross Validation found in the results field of the trained model:

print(model_lm_cv$results)
##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 1099.672 0.9264773 630.8736 399.7144 0.06174034 91.28056

The Cross validation results provided us with more pessimistic scores compared to training the model because we are using a smaller amount of data to train the model.

The final model built with Cross Validation was built on the whole dataset so the results are the same as with our first model.

pred2 = predict(model_lm_cv,uni[test,])
rmse2 <- RMSE(uni$Apps[test],pred2)
r22 <- R2(uni$Apps[test],pred2)
sprintf('RMSE = %f',rmse2)
## [1] "RMSE = 1239.317749"
sprintf('r-squared = %f',r22)
## [1] "r-squared = 0.864277"

6. Ridge and Lasso regression

We will use the glmnet library to fit Ridge and Lasso regression models onto our training data.
Based on the cross validation results we will choose the best lambda coefficient for our model.
We will also plot how the number of variables with a coefficient greater than 0 and the mean square error (MSE) are affected by the lambda value.

library(glmnet)
grid=10^seq(10,-2,length=100)

x=model.matrix(Apps~.,uni)[,-2]
y=uni$Apps

ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)
ridge_cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(ridge_cv.out)

On the chart we can see on the bottom of X axis the logarithmic value of lambda.
On the top of X axis we can see the number of variables (predictors) whose coefficients are above 0.
On the Y axis we can observe the Mean Squared Error of the model.

We can notice on the top of the plot that even though the lambda value grows to 2^14, we still have coefficient for all variables. This is normal for Ridge regression, but we would expect it to be different for Lasso regression.

Now we will find the best lambda value:

bestlam_ridge = ridge_cv.out$lambda.min
print(bestlam_ridge)
## [1] 422.9006

We will also evaluate the model on our testing data according to R-squared and RMSE metrics:

ridge_pred=predict(ridge.mod,s=bestlam_ridge,newx=x[test,])
ridge_rmse = RMSE(y[test],ridge_pred)
ridge_r2 = R2(y[test],ridge_pred)[1]
sprintf('RMSE = %f',ridge_rmse)
## [1] "RMSE = 1095.329008"
sprintf('r-squared = %f',ridge_r2)
## [1] "r-squared = 0.888471"

Now let’s fit a Lasso model and choose the best lambda value using cross validation.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso_cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(lasso_cv.out)

bestlam=lasso_cv.out$lambda.min
print(bestlam)
## [1] 23.1
lasso_pred=predict(lasso.mod,s=bestlam,newx=x[test,])
lasso_rmse = RMSE(y[test],lasso_pred)
lasso_r2 = R2(y[test],lasso_pred)[1]
sprintf('RMSE = %f',lasso_rmse)
## [1] "RMSE = 1105.637421"
sprintf('r-squared = %f',lasso_r2)
## [1] "r-squared = 0.888094"

7. Comparison of all models

We will finally compare the three models we tested:
- Normal Linear Multiple Regression model
- Ridge regression
- Lasso Regression

name = c("Linear Model", "Ridge", "Lasso") 
rmse_list = c(lm_rmse,ridge_rmse, lasso_rmse) 
r2_list = c(lm_r2,ridge_r2,lasso_r2)
  
df_res = data.frame(name, rmse_list,r2_list )

print(df_res)
##           name rmse_list   r2_list
## 1 Linear Model  1239.318 0.8642769
## 2        Ridge  1095.329 0.8884710
## 3        Lasso  1105.637 0.8880939
barplot(height =df_res$rmse_list, names.arg = df_res$name, xlab = "Model", ylab = "Root Mean Squared Error", beside = True)

barplot(height =df_res$r2_list, names.arg = df_res$name, xlab = "Model", ylab = "R-squared")

We can observe that both Ridge and Lasso regression models had the best R-squared and RMSE scores. Their scores were also very similar.
The Linear model had a slighly lower R-squared values value and a higher RMSE errors.
We can conclude that indeed Ridge and Lasso predictions improved the performance of our model.

Overall with the R-squared value of 0.9 our model can explain around 90% of the variance which is a solid score.
The RMSE of 1000 is somewhat large considering that the average number of application a university receives is 3002, but RMSE does penalize large deviations more.
We cannot perfectly predict the number of application but we do have a solid model.